AITopics | Image Understanding

Collaborating Authors

Image Understanding

"Image understanding (IU) is the research area concerned with the design and experimentation of computer systems that integrate explicit models of a visual problem domain with one or more methods for extracting features from images and one or more methods for matching features with models using a control structure. Given a goal, or a reason for looking at a particular scene, these systems produce descriptions of both the images and the world scenes that the images represent."
– Image Understanding, by J.K. Tsotos. In Encyclopedia of Artificial Intelligence. Stuart C. Shapiro, editor. 1987. New York: John Wiley & Sons.

News Overviews Instructional Materials AI-Alerts Classics

MultiScan: Scalable RGBD scanning for 3D environments with articulated objects

Neural Information Processing SystemsMay-29-2025, 10:12:58 GMT

We introduce MultiScan, a scalable RGBD dataset construction pipeline leveraging commodity mobile devices to scan indoor scenes with articulated objects and web-based semantic annotation interfaces to efficiently annotate object and part semantics and part mobility parameters. We use this pipeline to collect 273 scans of 117 indoor scenes containing 10957 objects and 5129 parts. The resulting MultiScan dataset provides RGBD streams with per-frame camera poses, textured 3D surface meshes, richly annotated part-level and object-level semantic labels, and part mobility parameters.

artificial intelligence, machine learning, segmentation, (17 more...)

Neural Information Processing Systems

Country:

North America (0.14)
Asia (0.14)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Robots (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
(2 more...)

Add feedback

Text-DiFuse: An Interactive Multi-Modal Image Fusion Framework based on Text-modulated Diffusion Model

Neural Information Processing SystemsMay-29-2025, 08:49:00 GMT

Existing multi-modal image fusion methods fail to address the compound degradations presented in source images, resulting in fusion images plagued by noise, color bias, improper exposure, etc. Additionally, these methods often overlook the specificity of foreground objects, weakening the salience of the objects of interest within the fused images. To address these challenges, this study proposes a novel interactive multi-modal image fusion framework based on the text-modulated diffusion model, called Text-DiFuse.

artificial intelligence, image understanding, machine learning, (18 more...)

Neural Information Processing Systems

Country: Asia > China (0.14)

Genre: Research Report > Experimental Study (0.93)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.86)

Add feedback

Bidirectional Recurrence for Cardiac Motion Tracking with Gaussian Process Latent Coding Jiewen Yang Yiqun Lin Bin Pu Xiaomeng Li

Neural Information Processing SystemsMay-29-2025, 05:38:35 GMT

Quantitative analysis of cardiac motion is crucial for assessing cardiac function. This analysis typically uses imaging modalities such as MRI and Echocardiograms that capture detailed image sequences throughout the heartbeat cycle. Previous methods predominantly focused on the analysis of image pairs lacking consideration of the motion dynamics and spatial variability. Consequently, these methods often overlook the long-term relationships and regional motion characteristic of cardiac. To overcome these limitations, we introduce the GPTrack, a novel unsupervised framework crafted to fully explore the temporal and spatial dynamics of cardiac motion. The GPTrack enhances motion tracking by employing the sequential Gaussian Process in the latent space and encoding statistics by spatial information at each time stamp, which robustly promotes temporal consistency and spatial variability of cardiac dynamics.

artificial intelligence, image understanding, machine learning, (20 more...)

Neural Information Processing Systems

Country:

Europe (0.92)
Asia > China (0.28)
North America > Canada > Quebec (0.14)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.86)

Add feedback

Multistable Shape from Shading Emerges from Patch Diffusion

Neural Information Processing SystemsMay-29-2025, 05:33:56 GMT

Models for inferring monocular shape of surfaces with diffuse reflection--shape from shading--ought to produce distributions of outputs, because there are fundamental mathematical ambiguities of both continuous (e.g., bas-relief) and discrete (e.g., convex/concave) types that are also experienced by humans. Yet, the outputs of current models are limited to point estimates or tight distributions around single modes, which prevent them from capturing these effects. We introduce a model that reconstructs a multimodal distribution of shapes from a single shading image, which aligns with the human experience of multistable perception. We train a small denoising diffusion process to generate surface normal fields from 16 16 patches of synthetic images of everyday 3D objects.

artificial intelligence, computer vision, machine learning, (17 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.67)

Add feedback

Unsupervised Foreground Extraction via Deep Region Competition

Neural Information Processing SystemsMay-29-2025, 05:33:02 GMT

We present Deep Region Competition (DRC), an algorithm designed to extract foreground objects from images in a fully unsupervised manner. Foreground extraction can be viewed as a special case of generic image segmentation that focuses on identifying and disentangling objects from the background. In this work, we rethink the foreground extraction by reconciling energy-based prior with generative image modeling in the form of Mixture of Experts (MoE), where we further introduce the learned pixel re-assignment as the essential inductive bias to capture the regularities of background regions. With this modeling, the foregroundbackground partition can be naturally found through Expectation-Maximization (EM). We show that the proposed method effectively exploits the interaction between the mixture components during the partitioning process, which closely connects to region competition [1], a seminal approach for generic image segmentation. Experiments demonstrate that DRC exhibits more competitive performances on complex real-world data and challenging multi-object scenes compared with prior methods. Moreover, we show empirically that DRC can potentially generalize to novel foreground objects even from categories unseen during training.

artificial intelligence, machine learning, proceedings, (15 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
Asia > China (0.14)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.93)
(3 more...)

Add feedback

Bridge the Modality and Capability Gaps in Vision-Language Model Selection Chao Yi, Yu-Hang He, De-Chuan Zhan, Han-Jia Ye

Neural Information Processing SystemsMay-29-2025, 05:23:19 GMT

Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names. The expanding variety of Pre-Trained VLMs enhances the likelihood of identifying a suitable VLM for specific tasks. To better reuse the VLM resource and fully leverage its potential on different zeroshot image classification tasks, a promising strategy is selecting appropriate Pre-Trained VLMs from the VLM Zoo, relying solely on the text data of the target dataset without access to the dataset's images. In this paper, we analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection: the "Modality Gap"--the disparity in VLM's embeddings across two different modalities, making text a less reliable substitute for images; and the "Capability Gap"-- the discrepancy between the VLM's overall ranking and its ranking for target dataset, hindering direct prediction of a model's dataset-specific performance from its general performance.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country: Asia (0.14)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.55)

Add feedback

A Appendix

Neural Information Processing SystemsMay-29-2025, 05:02:58 GMT

A.1 Data, Models, and Model Accuracies Images are pixel-wise normalized by the mean and standard deviation of the training images for each dataset, and for ImageNet all images are center cropped and resized to 224 224; this preprocessing is done before any interpolating paths are constructed. Our implementations are based on Cubuk et al. [6]; we use the same optimizer (stochastic gradient descent with momentum) and cosine learning rate schedule. We train without data augmentation (to ensure all models are trained on exactly the same examples), except for experiments that explicitly vary data augmentation. Without data augmentation, the test accuracies of our models are shown in Table 1. The top-1 classification accuracies of these models are presented in Table 2. Model ImageNet Test Accuracy (%) A.2 Linear Interpolation: Methodological Details For each sampled path, we compute the discrete Fourier transform (DFT) of the prediction function along the path separately for each of the M class predictions, take the (real) magnitude of the resulting (complex) DFT coefficients, and average them among the M classes.

data quality, frequency, machine learning, (20 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision > Image Understanding (0.69)
Information Technology > Data Science > Data Quality > Data Transformation (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Add feedback

2cd5737c59645f7ef23b2842b705edf2-Paper-Conference.pdf

Neural Information Processing SystemsMay-29-2025, 02:50:18 GMT

artificial intelligence, machine learning, prediction, (19 more...)

Neural Information Processing Systems

Country: North America > United States > California (0.14)

Genre: Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.34)

Add feedback

Appendix A Proof of H(ɛ) increase = g(ɛ, y), and assume that (ɛ, x, y) (x, y

Neural Information Processing SystemsMay-29-2025, 02:21:38 GMT

Let ɛ be a noise variable independent from X, Y, and g: ɛ, Y Y be the augmentation function. If H(Y |X) is less than H(ɛ) away from the maximum entropy distribution, the one-to-one condition cannot be satisfied, and the theorem does not apply. This is generally not a concern in practice, since standard ML setups have exactly one label y per example x. This codebase applies random crops and random left-right flips to the training images. Training a ResNet-50 baseline achieves 76% top-1 accuracy after 60 epochs.

artificial intelligence, image understanding, machine learning, (18 more...)

Neural Information Processing Systems

Technology: